Introduction

As a way of showcasing what we have learned so far, we were tasked with cleaning and exploring one of the data sets provided to us. Towards that end, we selected the data set pertaining to dogs that are available for adoption in the United States. Using this data, we will attempt to draw some big-picture insights about the population of sheltered dogs in the US.

We will be utilizing RStudio to perform cleaning and exploratory analysis on the data set. And, in order to make our work easier to understand, the steps we took will be broken down into the following:

  1. Required Packages
  2. Data Preparation and Cleaning
  3. Exploratory Analysis
  4. Summary

Required Packages

library("tidyverse")# To help tidy up the data
library("readr")    # To import .csv files in a more feature rich way
library("here")     # To make it easier to work collaboratively on the project
library("dplyr")    # For data/dataframe manipulation
library("usmap")    # US map plots
library("ggplot2")  # Data visualization
library("prettydoc")# document themes for R Markdown
library("DT")       # used for displaying R data objects (matrices or data frames) as tables on HTML pages

Data Preparation and Cleaning

As the data was provided to us, the source is on our class’ Canvas page

The data is broken down into three .csv files:

The data was collected on 9/20/2019 from various dog adoption organizations across the US.

First, setting our work directory and retrieving the data

here::i_am("Final.Rproj")
here() # set current directory to top-level of project 

# load dog_adoptable, dog_descriptions and dog_destination
dog_adoptable <- read_csv("data/raw/dog_adoptable.csv")
dog_descriptions <- read_csv("data/raw/dog_descriptions.csv")
dog_destination <- read_csv("data/raw/dog_destination.csv")

# loading the variable information for each table
dog_adoptable_variable_table <- read_csv("data/raw/dog_adoptable_variable_table.csv")
dog_descriptions_variable_table <- read_csv("data/raw/dog_descriptions_variable_table.csv")
dog_destination_variable_table <- read_csv("data/raw/dog_destination_variable_table.csv")


Now we can see the variable descriptions for the data we’ve loaded.


dog_adoptable variables and descriptions
variable class description
location character The full name of the US state or country
exported double The number of adoptable dogs available in the US that originated in this location but were available for adoption in another location
imported double The number of adoptable dogs available in this state that originated in a different location
Total double The total number of adoptable dogs available in a given state.
inUS logical Whether or not a location is in the US or not. Here, US territories will returnFALSE


dog_adoptable variables and descriptions
variable class description
id double The unique identification number for each animal.
org_id character The unique identification number for each shelter or rescue.
species character Species of animal.
breed_primary character The primary (assumed) breed assigned by the shelter or rescue.
breed_secondary character The secondary (assumed) breed assigned by the shelter or rescue.
breed_mixed logical Whether or not an animal is presumed to be mixed breed.
breed_unknown logical Whether or not the animal’s breed is completely unknown.
color_primary character The most prevalent color of an animal.
color_secondary character The second most prevalent color of an animal.
color_tertiary character The third most prevalent color of an animal.
age character The assumed age class of an animal (Baby,Young,Adult, orSenior).
sex character The sex of an animal (Female,Male, orUnknown).
size character The general size class of an animal (Small,Medium,Large,Extra Large).
coat character Coat Length for each animal (Curly,Hairless,Long,Medium,Short,Wire).
fixed logical Whether or not an animal has been spayed/neutered.
house_trained logical Whether or not an animal is trained to not go to the bathroom in the house.
declawed logical Whether or not the animal has had its dewclaws removed.
special_needs logical Whether or not the animal is considered to have special needs (this can be a long-term medical condition or particular temperament that requires extra care).
shots_current logical Whether or not the animal is up to date on all of their vaccines and other shots.
env_children logical Whether or not the animal is recommended for a home with children.
env_dogs logical Whether or not the animal is recommended for a home with other dogs.
env_cats logical Whether or not the animal is recommended for a home with cats.
name character The animal?s name (as given by the shelter/rescue).
tags character Any tags given to the dog by the shelter rescue (pipe|separated).
photo character The URL to the animal?s primary photo.
status character Whether the animal isadoptableor not.
posted character The date that this animal was first listed on a local website .
contact_city character The rescue/shelter?s listed city.
contact_state character The rescue/shelter?s listed state.
contact_zip character The rescue/shelter?s listed zip code.
contact_country character The rescue/shelter?s listed country.
stateQ character The state abbreviation queried in the API to return this result .
accessed double The date that this data was acquired from the PetFinder API.
type character The type of animal.
description character The full description of an animal, as entered by the rescue or shelter. This is the only field returned by the V1 API.


dog_adoptable variables and descriptions
variable class description
Id double The unique identification number for each animal
contact_city character The rescue/shelter’s listed city
contact_state character The rescue/shelter’s listed State
Description character The full description of each animal as entered by the rescue/shelter
Found character Where the animal was found.
Manual character .
Remove logical Animal removed from location
still_there logical TRUE/FALSE- Whether the animal is still located in their origin location and will be transported to their final destinationafteradoption.


Prepping to clean

Now, after reviewing the data set’s variables and their meanings we had to decide how to clean them. We found that rather than going through and cleaning ever single variable, it worked better for us to clean the data while keeping in mind what questions we wanted to be able to answer using the data.


Some of the questions we wanted to answer were:

  • How many adoptable dogs were there in each state?
  • What is the most common breed of the adoptable dogs in each state?
  • How long has each dog been up for adoption?


Figuring out what questions we wanted to ask first helped us shave down the number of variables we needed to keep (and clean) in the data set.


Cleaning the data

dog_adoptable
# dog_adoptable

# update field inUS to snake_case in_us
dog_adoptable <- rename(dog_adoptable, in_us = inUS)

# filter to just true for in_us
dog_adoptable <- filter(dog_adoptable, in_us == TRUE)

# drop field in_us as all records have same value
dog_adoptable <- select(dog_adoptable, !in_us)

# replace all NA with 0
dog_adoptable <- mutate_all(dog_adoptable, ~replace(., is.na(.), 0))

# rename location to state
dog_adoptable <- rename(dog_adoptable, state = location)


     Starting with the dog_adoptable data, we renamed the ‘inUS’ variable to in_us, however, since all the observations are located in the US we ended up discarding the ‘in_us’ column. We also renamed the ‘location’ variable to ‘state’ as we felt that title more accurately reflected the values in the column. And, while there are outliers in ‘exported’, ‘imported’ and ‘total’, we found them all of them to be believable values.


dog_destination
# dog_destination (removed)


      The dog_destination data was by and far the worst one. It contained a minimal amount of additional information, and the information it did contain was contradictory. For example, the ‘contact_city’ and ‘contact_state’ would contradict where the ‘found’ variable, and the variables ‘remove’ and ‘still_there’ did not reflect the dog’s change of location.

Not only this, but the values in the ‘found’ column were inconsistent. It lists countries, counties, cities, and nonsensical values such as ‘Sunday 10am’ or ‘Glaucoma’. As a result, we decided to drop the table completely.


dog_descriptions
# dog_descriptions

# drop stateQ as is only "The state abbreviation queried in the API to return this result " 
dog_descriptions <- select(dog_descriptions, !stateQ)

# drop status field as all are dogs adoptable
dog_descriptions <- select(dog_descriptions, !status)

# drop species field as all are dogs so adds no value.
dog_descriptions <- select(dog_descriptions, !species)

# drop type field as all are dogs so adds no value.3 are NA but confirmed they are dogs with their description
dog_descriptions <- select(dog_descriptions, !type)

# drop photo as all are NA
dog_descriptions <- select(dog_descriptions, !photo)

# drop name as useless
dog_descriptions <- select(dog_descriptions, !name)

# drop tags as useless
dog_descriptions <- select(dog_descriptions, !tags)

# drop description as useless
dog_descriptions <- select(dog_descriptions, !description)

# drop declawed as all are NA
dog_descriptions <- select(dog_descriptions, !declawed)

# drop contact_country as all are in the US, some have state or zip here by error
dog_descriptions <- filter(dog_descriptions, contact_country == "US")
dog_descriptions <- select(dog_descriptions, !contact_country)


      Here we went through dog_descriptions and dropped columns that either contained information that we did not need, or had no useful values in them. For example, all of the values for ‘declawed’ were ‘NA’ so we dropped the column.


dog_destination (continued)
dog_descriptions <- mutate(dog_descriptions, posted_date = as.Date(posted))
dog_descriptions <- mutate(dog_descriptions, accessed_date = as.Date(accessed, "%d/%m/%Y"))
dog_descriptions <- mutate(dog_descriptions, days_in_shelter = 
                             as.numeric(difftime(dog_descriptions$accessed_date, 
                                                 dog_descriptions$posted_date , units = c("days"))))
dog_descriptions <- select(dog_descriptions, !c(posted, posted_date, accessed, accessed_date))

# fix zip na, all are in Boston 02108
dog_descriptions <- mutate_at(dog_descriptions, vars("contact_zip"), ~replace(., is.na(.), 02108))

# TODO: why is this not padding zip with 0s?
dog_descriptions <- mutate(dog_descriptions,
          zip = str_pad(string = contact_zip,
                              width = 5,
                              side = "left",
                              pad = "0"))

#rename to city state and zip
dog_descriptions <- select(dog_descriptions, !contact_zip)
dog_descriptions <- rename(dog_descriptions, city = contact_city)
dog_descriptions <- rename(dog_descriptions, state = contact_state)

# state abbreviation to full names
state.abb.and.name <- tibble(state.abb, state.name)

# left join for state information
dog_descriptions <- left_join(dog_descriptions, state.abb.and.name, by = c("state" = "state.abb"))


In this section we renamed, and mutated some variables.


dog_destination (continued)
# breed_secondary
dog_descriptions$breed_secondary[is.na(dog_descriptions$breed_secondary)] <- "NONE / UNKNOWN"

# color_primary
dog_descriptions$color_primary[is.na(dog_descriptions$color_primary)] <- "OTHER"

# color_secondary
dog_descriptions$color_secondary[is.na(dog_descriptions$color_secondary)] <- "NONE / OTHER"

# color_tertiary
dog_descriptions$color_tertiary[is.na(dog_descriptions$color_tertiary)] <- "NONE / OTHER"

# coat
dog_descriptions$coat[is.na(dog_descriptions$coat)] <- "OTHER"


      Here we took care of the remaining ‘NA’ values in the character type columns. As opposed to having colors or breeds being listed as ‘NA’ we decided that a value of ‘NONE / UNKOWN’, ‘NONE / OTHER’, or ‘OTHER’ were more appropriate.
      The reasoning behind our logic was that a dog could be purebred or a mix of different breeds with no distinguishable second primary breed. We applied the same logic to the dog coloration and coat since sometimes there is no obviously dominant coloration or appropriate descriptor for what the dog might be.


Below are small chunks of the cleaned data


dog_adoptable after cleaning


dog_descriptions after cleaning


Exploratory Analysis

      After cleaning the data, we reconsidered the questions we wanted to ask of it.
  • How many adoptable dogs were there in each state?
  • What is the most common breed of the adoptable dogs in each state?
  • How long has each dog been up for adoption?
      We felt that our first two questions would provide quality information on how the overall population of adoptable dogs was broken up in each state. Our third question however, only addressed dogs at an individual level. While knowing how long each dog has been up for adoption is good information to have, we felt that we could do more. Instead of just extracting that information, we decided to put it to use and see what we could get out of it.


Our finalized questions for the exploratory analysis are:


How many adoptable dogs are there in each state?

This question is to give us the initial idea of the spread across the nation. This information can be found using the dog_adoptable data set. Using the state and total columns, we were able to build the below map. Below is a heat map of the United states to visualize the availability of dogs in shelters per state. The lighter blue the state represents a greater number of available dogs, where the darker blue represents fewer available dogs.

# Heat-map of adoptable dogs per state
plot_usmap(data = dog_adoptable, values = "total", color = "red") + 
  scale_fill_continuous(name = "Dogs Adoptable", label = scales::comma) + 
  theme(legend.position = "right") +
  ggtitle("Adoptable Dogs In Each State") +
  theme(plot.title = element_text(hjust = 0.5))


Where the above map may be nice to get an idea if regionality is at play, this table will let you know the exact count for each state.

# Table of adoptable dogs per state
adoptable_table <- dog_adoptable %>% 
  select(state, total) %>%
  arrange(desc(total))
datatable(adoptable_table,options = list(scrollX=TRUE, pageLength=10))


What is the most common breed up for adoption in each state?

This question was more complex than the first. It required grouping and join across multiple data sets, which become more complicated as they didn’t contain all complimentary data of the other (e.g.: D.C.). The code below was done this way to maintain each observation in he case of a tie instead of dropping one. Note that there are 52 observations because there are two ties, in South Dakota and Montana.

# Table of adoptable dogs per state
# State and count of max breed
temp1 <- dog_descriptions %>%
  select(breed_primary, state.name) %>% 
  group_by(state.name, breed_primary) %>% 
  summarise(count = n()) %>% 
  group_by(state.name) %>% 
  summarise(count = max(count))
#remove DC as is not in all data sets
temp1 = na.omit(temp1)

# state with all breed count
temp2 <- dog_descriptions %>%
  select(breed_primary, state.name) %>% 
  group_by(state.name, breed_primary) %>% 
  summarise(count = n())

# remove all that aren't the max count for tat state and adds breed
most_breed_per_state <-  left_join(temp1, temp2, by=c("state.name", "count"))
colnames(most_breed_per_state)[which(names(most_breed_per_state) == "state.name")] <- "state"
most_breed_per_state <- left_join(most_breed_per_state, dog_adoptable, by="state")
# get percent and reorder
most_breed_per_state <- mutate(most_breed_per_state, percent = count/total*100)
most_breed_per_state <- select(most_breed_per_state, -c("exported", "imported"))
most_breed_per_state <- most_breed_per_state[, c(1, 3, 2, 4, 5)]
# tie in SD and MT
datatable(most_breed_per_state,options = list(scrollX=TRUE, pageLength=10))

With this data we can make assumptions of what is available in certain states, and therefore assume the inverse that those breeds are not in demand there. With this knowledge you can make recommendations for transferring dogs to other states to increase their chance of adoption. (Note that this is an imperfect method of determining demand as we don’t also have data on dogs that have been adopted in the past, just what are available now.)


What is the average length of time each breed spends up for adoption?

      In dog_descriptions, there are four columns relating to breed. These columns are: breed_primary, breed_secondary, breed_mixed, and breed_unknown. The first two are character values, and the last two are boolean values. We opted to utilize the breed_primary column since it seemed like the best fit for our purposes.


Note: Since the breed_primary column contains 216 different breeds, we will restrict our scope and only look at the ten most common breeds that are up for adoption.


First things first, is a bit of setup. We need to get a count of how many dogs of each breed there are in the shelters.

# Count each dog in the shelter grouped by breed_primary
breed_count <- dog_descriptions %>%
  group_by(breed_primary) %>%
  count(breed_primary) %>%
  rename("in_shelter" = n) %>%
  arrange(desc(in_shelter))

Result:


Next, we need to find the average amount of time each breed spends in up for adoption.

# Find the average amount of time each breed spends in the shelter
average_shelter_time <- dog_descriptions %>%
  group_by(breed_primary) %>%
  summarize(breed_average_shelter_time = round(mean(days_in_shelter), 0))

Result:


Now to join the two tables.

# Join the previous two tables, arrange them (desc) and get the first ten
common_breed_summary <- 
  inner_join(breed_count, average_shelter_time, by = "breed_primary") %>%
  select(breed_primary, in_shelter, breed_average_shelter_time) %>%
  group_by(breed_primary) %>%
  arrange(desc(in_shelter)) %>%
  head(10)

Result:


While we’re at it, let’s grab the overall average amount of time that dogs spend up for adoption.

# Getting the average amount of time each dog spends in the shelter
overall_average <- round(mean(dog_descriptions$days_in_shelter))

Now we have something to relate our data to we’ll go ahead and visualize the information from the table and compare it to the average. We’ll be use a bar graph for our visualization.


Now we have a chart of the ten most common breeds as they relate to the average (which is represented by the black vertical line). But, while the chart is nice, it feels a bit lacking in numerical context. So we decided to create a table for that information.


With a little bit of math we can create a new column compared_to_average that shows how each breed’s average compares to the overall average as a percentage.

# Getting time spent in shelter as a percentage of overall average
common_breed_summary <- common_breed_summary %>%
  mutate(compared_to_average = 
           round(breed_average_shelter_time / overall_average * 100 - 100, 2))

Result:

      With this we now have the additional numerical context for the bar graph without cluttering up the visualization. With the overall average number of days a given dog will spend waiting to be adopted as our 100% mark, we now know how long on average each given breed spends in a shelter in comparison to it.


For example, of the ten most common breeds found in shelters across the US, ‘Mixed Breed’ dogs get adopted in 50.77% faster than the average. While ‘American Staffordshire Terrier’ breed dogs take an additional 72.31% of the average time to get adopted.


Summary of Analysis

The purpose of our analysis was to draw big-picture insights from the data set about shelter dogs in the United States. Towards that end we:

      Our resulting analysis provides a breakdown of information about the population of shelter dogs in the US. These insights include, the number of shelter dogs in each state, the most common breeds of shelter dogs in each state, and how long, on average, each breed spends waiting to be adopted.
      These insights paint a picture of the adoptable dogs in the US as a whole. However, the methods we used to extract this information are perfectly capable of being scaled down to the state, or even city level. And, while we did remove some data in the process of cleaning, the information that we did keep could be used to examine what factors contribute to how long a dog may remain in a shelter.
      If one were to further analyze this data set, they could determine what traits make a dog more desirable in different parts of the country. This means that shelters could work on developing those traits in their dogs, or could send dogs to a location where the traits they have are found to be more desirable. This could help minimize the amount of time dogs spend in shelters, and reduce the overall population of shelter dogs in the US.